In [1]:
import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix
plt.close('all')
In [2]:
from IPython.display import display, HTML
HTML('''
<script
    src="https://cdnjs.cloudflare.com/ajax/libs/jquery/2.0.3/jquery.min.js ">
</script>
<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.jp-CodeCell > div.jp-Cell-inputWrapper').hide();
 } else {
$('div.jp-CodeCell > div.jp-Cell-inputWrapper').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit"
    value="Click here to toggle on/off the raw code."></form>
''')
Out[2]:
In [3]:
%run 'Lab_preprocess.ipynb'
In [8]:
%run 'Lab_RepresentativeClustering.ipynb'
In [6]:
%run 'Lab_Hierarchal.ipynb'
In [7]:
%run 'Lab_DensityClustering.ipynb'
No description has been provided for this image


I. Executive Summary

No description has been provided for this image

Our analysis delves into the World Development Indicators, a comprehensive dataset with diverse global data, focusing specifically on educational aspects. The primary objectives were to unearth patterns in global educational data, understand the diversity of educational systems, and relate foundational education to overall educational outcomes. The desired outcome was to categorize countries based on shared educational traits to aid in benchmarking and inform policy decisions.

Methodologically, we utilized K-Medoids for its robustness in handling data anomalies and Ward's method in hierarchical clustering to minimize the sum of squared differences within clusters. We also applied Density-Based Spatial Clustering of Applications with Noise (DBSCAN) for its efficacy in grouping data points by density.

Our representative clustering identified centroids Morocco, Hungary, and Guinea, reflecting a spectrum of global educational realities. Morocco and Hungary, while showing high enrollment rates, differ in student retention beyond primary education, especially for girls. Guinea represents more fundamental systemic challenges, evident in its low secondary enrollment and high repetition rates. These insights guide targeted recommendations: dropout prevention strategies for Morocco's cluster, equitable support across socio-economic backgrounds in Hungary's cluster, and significant systemic investments for Guinea's cluster.

A purity test combining hierarchical and DBSCAN methods gave us a robust score of 81.37%, affirming the effectiveness of both methods in clustering similar data points. Hierarchical clustering, especially with Ward Linkage, adeptly complemented DBSCAN, identifying its outliers as a distinct cluster. This synergy underscores the utility of employing both methods for an in-depth analysis.

In conclusion, our clusters broadly delineate more and less developed education systems, but it's vital to recognize the unique educational strengths and challenges within each country in a cluster. This nuanced understanding is essential for crafting effective educational strategies, especially in contexts similar to Guinea's cluster, as observed in the Philippines, highlighting the need for infrastructural and curricular reforms.


II. Introduction

No description has been provided for this image
No description has been provided for this image
No description has been provided for this image

No description has been provided for this image
Column Name Description
Country Code Unique code for each country
Year Year of data collection
Various Indicators Different educational indicators like duration of compulsory education, intake ratios, school ages, enrollment percentages, etc.
Indicator Name Descriptive name of the educational indicator
Indicator Code Unique code for each educational indicator.
These will serve as our features for our clustering



The table has various indicators showing how countries approach primary and secondary education.
It includes things like how long kids have to stay in school, how many boys and girls are enrolled, the ages they
start at, and if students are older than usual for their grade or repeating a year.
Features Description
SE.COM.DURSCompulsory education, duration (years)
SE.ENR.PRIM.FM.ZSGross intake ratio in first grade of primary education, female (% of relevant age group)
SE.ENR.SECO.FM.ZSGross intake ratio in first grade of primary education, male (% of relevant age group)
SE.PRM.AGESPrimary school starting age (years)
SE.PRM.DURSPrimary education, duration (years)
SE.PRM.ENRL.FE.ZSPrimary education, pupils (% female)
SE.PRM.ENRRSchool enrollment, primary (% gross)
SE.PRM.ENRR.FESchool enrollment, primary, female (% gross)
SE.PRM.ENRR.MASchool enrollment, primary, male (% gross)
SE.PRM.GINT.FE.ZSGross intake ratio in first grade of primary education, female (% of relevant age group)
SE.PRM.GINT.MA.ZSGross intake ratio in first grade of primary education, male (% of relevant age group)
SE.PRM.GINT.ZSGross intake ratio in first grade of primary education, total (% of relevant age group)
SE.PRM.NENRSchool enrollment, primary (net)
Features Description
SE.PRM.OENR.FE.ZSOver-age students, primary, female (% of female enrollment)
SE.PRM.OENR.MA.ZSOver-age students, primary, male (% of male enrollment)
SE.PRM.OENR.ZSOver-age students, primary (% of enrollment)
SE.PRM.PRIV.ZSSchool enrollment, primary, private (% of total primary)
SE.PRM.REPT.ZSRepeaters, primary, total (% of total enrollment)
SE.SEC.AGESLower secondary school starting age (years)
SE.SEC.DURSSecondary education, duration (years)
SE.SEC.ENRL.FE.ZSSecondary education, general pupils (% female)
SE.SEC.ENRL.GC.FE.ZSSchool enrollment, secondary, female (% gross)
SE.SEC.ENRRSchool enrollment, secondary (% gross)
SE.SEC.ENRR.FESchool enrollment, secondary, female (% gross)
SE.SEC.ENRR.MASchool enrollment, secondary, male (% gross)
SE.SEC.GPISchool enrollment, secondary (gross), gender parity index (gpi)


III. Methodology

No description has been provided for this image

1. Data Preprocessing

DATASET LIMITATIONS AND ASSUMPTIONS

  1. The dataset contains many null values so we had to reduce countries to 102.
  2. All features are directly related to education based on WDI themselves.
  3. All features are either ratios or % of GDP or total population as we to avoid as much absolute values as we can so we could compare them effectively.

1. WDI.db contains all Indicators that are related to education based on WDI's descriptions and types set to each Indicator.¶

In [9]:
df_ed.tail()
Out[9]:
Indicator Name
139 trained teachers in secondary education, femal...
140 trained teachers in secondary education, male ...
141 trained teachers in upper secondary education ...
142 trained teachers in upper secondary education,...
143 trained teachers in upper secondary education,...

Here we limited the Indicator Name to just those related to education.

In [10]:
df_main.head()
Out[10]:
Country Name Country Code Indicator Name Indicator Code 1960 1961 1962 1963 1964 1965 ... 2013 2014 2015 2016 2017 2018 2019 2020 2021 Unnamed: 66
0 africa eastern and southern AFE access to clean fuels and technologies for coo... EG.CFT.ACCS.ZS NaN NaN NaN NaN NaN NaN ... 16.936004 17.337896 17.687093 18.140971 18.491344 18.825520 19.272212 19.628009 NaN NaN
1 africa eastern and southern AFE access to clean fuels and technologies for coo... EG.CFT.ACCS.RU.ZS NaN NaN NaN NaN NaN NaN ... 6.499471 6.680066 6.859110 7.016238 7.180364 7.322294 7.517191 7.651598 NaN NaN
2 africa eastern and southern AFE access to clean fuels and technologies for coo... EG.CFT.ACCS.UR.ZS NaN NaN NaN NaN NaN NaN ... 37.855399 38.046781 38.326255 38.468426 38.670044 38.722783 38.927016 39.042839 NaN NaN
3 africa eastern and southern AFE access to electricity (% of population) EG.ELC.ACCS.ZS NaN NaN NaN NaN NaN NaN ... 31.794160 32.001027 33.871910 38.880173 40.261358 43.061877 44.270860 45.803485 NaN NaN
4 africa eastern and southern AFE access to electricity, rural (% of rural popul... EG.ELC.ACCS.RU.ZS NaN NaN NaN NaN NaN NaN ... 18.663502 17.633986 16.464681 24.531436 25.345111 27.449908 29.641760 30.404935 NaN NaN

5 rows × 67 columns

You will also notice that Country Name contains regions representing multiple countries like africa eastern and southern, africa western and central, arab world, etc. and titles that represent various countries like low-income countries.


2. Merge indicators and indicator description with main dataset.¶

In [11]:
df_test.head()
Out[11]:
Indicator Name Country Name Country Code Indicator Code 1960 1961 1962 1963 1964 1965 ... 2013 2014 2015 2016 2017 2018 2019 2020 2021 Unnamed: 66
0 adjusted net enrollment rate, primary (% of pr... africa eastern and southern AFE SE.PRM.TENR NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 adjusted net enrollment rate, primary (% of pr... africa western and central AFW SE.PRM.TENR NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN NaN NaN
2 adjusted net enrollment rate, primary (% of pr... arab world ARB SE.PRM.TENR NaN NaN NaN NaN NaN NaN ... 84.21832 84.25430 84.03523 84.53258 85.14375 85.38422 NaN NaN NaN NaN
3 adjusted net enrollment rate, primary (% of pr... caribbean small states CSS SE.PRM.TENR NaN NaN NaN NaN NaN NaN ... 89.77977 89.57198 90.92441 90.48512 89.39624 88.92917 NaN NaN NaN NaN
4 adjusted net enrollment rate, primary (% of pr... central europe and the baltics CEB SE.PRM.TENR NaN NaN NaN NaN NaN NaN ... 94.01037 93.41415 93.45411 93.06906 92.81936 91.02484 NaN NaN NaN NaN

5 rows × 67 columns


3. Put Indicator Codes in their respective columns since they will be our features.¶

4. We chose 2011 because it's the most complete (year with the least Null values).¶

5. Limit the Country Code to just countries since the previous table includes non-countries or regions representing multiple countries.¶

Since the dataset initially had Years for columns while the Indicator Name was listed in rows, we interchange them and drop unnecessary columns.

In [12]:
df_2011.head()
Out[12]:
Country Code Year SE.ADT.1524.LT.FE.ZS SE.ADT.1524.LT.FM.ZS SE.ADT.1524.LT.MA.ZS SE.ADT.1524.LT.ZS SE.ADT.LITR.FE.ZS SE.ADT.LITR.MA.ZS SE.ADT.LITR.ZS SE.COM.DURS ... SE.XPD.CTER.ZS SE.XPD.CTOT.ZS SE.XPD.PRIM.PC.ZS SE.XPD.PRIM.ZS SE.XPD.SECO.PC.ZS SE.XPD.SECO.ZS SE.XPD.TERT.PC.ZS SE.XPD.TERT.ZS SE.XPD.TOTL.GB.ZS SE.XPD.TOTL.GD.ZS
41 ABW 2011 NaN NaN NaN NaN NaN NaN NaN 13.0 ... 100.000000 100.000000 NaN NaN NaN NaN NaN NaN 21.750540 6.11913
143 AFG 2011 32.113220 0.51897 61.879070 46.990051 17.017839 45.417099 31.448851 9.0 ... 77.549629 82.625092 12.21159 61.97491 12.56712 26.62432 96.09478 8.98621 16.048429 3.46201
245 AGO 2011 NaN NaN NaN NaN NaN NaN NaN 6.0 ... NaN NaN NaN NaN NaN NaN NaN NaN 8.940000 3.03000
296 ALB 2011 98.856239 1.00126 98.731361 98.791191 95.691483 98.008163 96.845299 8.0 ... NaN NaN NaN NaN NaN NaN NaN NaN 12.840000 3.08000
347 AND 2011 NaN NaN NaN NaN NaN NaN NaN 10.0 ... 94.550323 98.824097 17.79961 28.80875 13.52729 21.35985 NaN 3.84056 NaN 2.98706

5 rows × 139 columns


6. Remove the Nulls by setting thresholds we're comfortable enough to impute. In this case, it's okay if we drop countries with null values on the features we selected. Since we wanted to retain as much features as we can, we opted for high thresholds in features.¶

In [13]:
print("After cleaning, we are left with: ", df_country.shape)
After cleaning, we are left with:  (102, 27)

7. We performed imputation using IterativeImputer with a LinearRegression() estimator¶

In [14]:
print(f'Current Number of Nulls: {X.isnull().sum().sum()}')
X.tail()
Current Number of Nulls: 0
Out[14]:
SE.COM.DURS SE.ENR.PRIM.FM.ZS SE.ENR.SECO.FM.ZS SE.PRM.AGES SE.PRM.DURS SE.PRM.ENRL.FE.ZS SE.PRM.ENRR SE.PRM.ENRR.FE SE.PRM.ENRR.MA SE.PRM.GINT.FE.ZS ... SE.PRM.OENR.ZS SE.PRM.PRIV.ZS SE.PRM.REPT.ZS SE.SEC.AGES SE.SEC.DURS SE.SEC.ENRL.FE.ZS SE.SEC.ENRL.GC.FE.ZS SE.SEC.ENRR SE.SEC.ENRR.FE SE.SEC.ENRR.MA
97 8.0 0.99054 0.917890 6.0 5.0 48.77150 101.049561 100.559967 101.520119 100.08062 ... 5.76218 11.381853 2.35584 11.0 7.0 47.037630 48.466150 88.338303 84.496590 92.055481
98 11.0 1.01228 0.970870 6.0 4.0 48.87233 100.632607 101.264503 100.035919 105.36214 ... 7.94891 0.539720 0.06396 10.0 7.0 48.045860 49.270140 93.503242 92.088188 94.851112
99 14.0 0.97039 1.542867 6.0 6.0 48.20959 113.254959 111.518219 114.920952 100.75008 ... 9.80920 16.491699 5.38069 12.0 6.0 61.623135 60.108850 105.427803 127.998695 83.583115
100 12.0 0.97978 0.992780 7.0 4.0 48.34564 94.538162 93.550720 95.481422 94.23476 ... 1.02560 3.002763 0.00431 11.0 8.0 48.602320 50.412313 90.215843 89.881271 90.534523
101 14.0 0.97535 1.093720 6.0 6.0 48.26757 102.418861 101.112923 103.668114 95.77170 ... 9.40757 17.549299 3.52591 12.0 5.0 51.212060 51.295100 83.746429 87.575607 80.071411

5 rows × 25 columns

8. We conduct PCA on the 25 education features.¶

We set our n_components to 7 which already explains 99% of the variance.

In [15]:
country_pca.head()
Out[15]:
0 1 2 3 4 5 6
0 13.768181 -15.374378 -10.625675 2.439129 -5.916001 0.612051 3.656233
1 30.939147 23.383609 11.415078 -11.658356 -3.554324 -1.906466 -0.194884
2 19.428307 -2.824350 -14.371960 7.583378 2.049612 -1.291544 0.425541
3 25.801059 -7.061240 -9.232107 -4.382353 -12.541582 6.097702 3.665695
4 1.692947 -15.695533 -15.577955 2.456936 -2.833778 0.855853 1.027908

No description has been provided for this image

2. Representative Clustering: K-Medoids

K-Medoids is a clustering algorithm that, unlike K-Means, chooses actual data points as cluster centers, which are known as medoids. This method is particularly robust as it selects the most centrally located point in a cluster, ensuring that the centers are representative of the actual data distribution and not skewed by outliers.

The algorithm is as follows:¶

Algorithm GenericMedoids(Database: $D$, Number of representatives: $k$)
begin
Initialize representative set $S$ by selecting from $D$;
repeat
$\quad$Create clusters ($C_1, ..., C_k$) by assigning each point in $D$ to closest representative in $S$ using the distance function $Dist(\cdot,\cdot)$
$\quad$Determine a pair $x_i \in D$ and $y_j \in S$ such that replacing $y_j$ with $x_i$ leads to the greatest possible improvement in objective function
$\quad$Perform the exchange between $x_i$ and $y_j$ only if improvement is positive
until no improvement in current iteration;
return $(C_i, ..., C_k)$;
end



Opting for K-Medoids over K-Means is advisable when the dataset contains anomalies or outliers, as K-Medoids is less sensitive to such variations. In the context of our dataset, which likely includes noise and outliers, K-Medoids provides a more reliable clustering by anchoring the clusters to genuine, observed data points, making the resulting groupings more interpretable and applicable for real-world scenarios. For Choosing Optimized number of cluster we will employ the following:

  • Sum of squares distances to centroid (SSD):
    Smaller values suggest better clustering $$ \text{SSD} = \sum_{i=1}^{k} \sum_{x \in C_i} ||x - \mu_i||^2 $$
  • Calinski-Harabasz index (CH):
    The higher the value of this measure, the more defined the clusters are. $$s_k = \frac{B_k/(k-1)}{W_k/(n-k)}$$
  • Silhouette coefficient (SC):
    A value between 0.5 and less than 1 means a more defined cluster. $$ S_i = \frac{Dmin^{out}_i − Davg^{in}_i}{\max\{{Dmin^{out}_i}, Davg^{in}_i\}}$$
  • Davies-Bouldin Index (DB):
    Small values of $DB$ imply compact and separated clusters $$DB = \frac{1}{k} \sum_i R_i = \frac{1}{k} \sum_i \max_{i \neq j}{R_{ij}}$$
  • Gap Statistics (GS):
    The $k$ before the sudden rate of change $$\text{Gap}_n(k) = \frac{1}{b} \sum_i^b \log(\bar{W}_{k,i}) - \log(\bar{W}_k),$$

3. Heirarchical Clustering: Ward's Method

Ward's Method defines the distance between two clusters, $A$ and $B$, as the amount the sum of squares will increase when we merge them:

$$ \Delta(A, B) = \sum_{i \in A \bigcup B} \|x_i - m_{A \bigcup B}\|^2 - \sum_{i \in A} \|x_i - m_A\|^2 - \sum_{i \in B} \|x_i - m_B\|^2 $$

where:

  • $m_j$ is the center of cluster $j$
  • $n_j$ is the number of points in it
  • $\Delta$ is the merging cost of combining the clusters $A$ and $B$.

Starting from individual points as a cluster, the method merges them while trying to minimize the growth of $\Delta$. Given two pairs of clusters whose centers are equally far apart, Ward’s method will prefer to merge the smaller ones.

Using Ward's method for our dataset is a strategic choice because it helps us form clusters by minimizing the sum of squared differences within all clusters. This means that countries grouped together have similar educational metrics, which is exactly what we're looking for. By applying Ward's method, we can confidently say that countries within each cluster have comparable education indicators, making it easier for us to draw insights and make recommendations. It's especially useful for spotting distinct patterns or outliers in our data, guiding us in providing tailored suggestions for educational improvements.

4. Density-Based Clustering: Density-based Spatial Clustering of Applications with Noise (DBSCAN)

DBSCAN is a clustering algorithm that groups points based on density. It labels each point in a dataset as a core point, border point, or noise.

  • core point: points that have at least $MinPts$ in its neighborhood
  • border point: points that do not have at least $MinPts$ neighbors but have a core point as its neighbor
  • noise point: points that do not have at least $MinPts$ neighbors and do not have a core point as its neighbor

The choice of 𝜖 and MinPts is crucial. For example, in spatial datasets, if the maximum interaction distance is 100 meters, setting 𝜖 to 100 meters and MinPts to a value like 3 would group points into clusters only if there are at least three within this range. This method effectively separates denser regions (clusters) from less dense areas (noise).

Input:

  • $D$: a dataset containing $n$ objects
  • $\epsilon$ : the radius parameter
  • $MinPts$: the neighborhood density threshold

Output: A set of density-based clusters
Method:
mark all objects as unvisited;
    do
        randomly select an unvisited object $p$;
        mark $p$ as visited;
        if the $\epsilon$-neighborhood of $p$ has at least $MinPts$ objects
            create new cluster $C$, and add $p$ to $C$;
            let $N$ be the set of objects in the $\epsilon$-neighorhood of $p$;
            for each point $p'$ in $N$
                if $p'$ is unvisited
                    mark $p'$ as visited;
                    if the $\epsilon$-neighorhood of $p'$ has at least $MinPts$ points,
                    add those points to $N$;
                if $p'$ is not yet a member of any cluster, add $p'$ to $C$;
            end for
        output C;
        else mark $p$ as noise;
    until no object is unvisited;
end

DBSCAN is a practical choice for our dataset because it automatically detects the number of clusters and is excellent at dealing with outliers, labeling sparse points as noise. This method's ability to handle arbitrary cluster shapes and sizes makes it versatile, especially since we don't need to specify how many clusters we expect in advance. Its robustness to anomalies ensures that the clusters formed are genuinely representative of significant trends in our data, making it a suitable tool for analyzing educational datasets that may contain irregularities or unusual patterns. For this report, the hyperparameters used were chosen via experemting with different possible hyperparameters. In this case the parameters eps will be 38 and the min_samples will be 11.

5. External Validation: Cluster Purity Test

Cluster Purity Test is an External Validation Criteria that measures the quality of two different algorithm and is given by: $$ \text{Purity} = \frac{\sum_{j=1}^{k_d} P_j}{\sum_{j=1}^{k_d} M_j}. $$

Where,
$$ N_i = \sum_{j=1}^{k_d} m_{ij} \\ M_j = \sum_{i=1}^{k_t} m_{ij} \\ P_j = \max_i m_{ij} $$

$m_{ij}$ is the number of datapoints that are mapped from cluster $i$ to cluster $j$
$N_i$ and $M_j$ is the number of data points in the cluster $i$ and $j$ respectively
$P_j$ in the dominant class of cluster $j$


IV. Results and Discussion

No description has been provided for this image

I. Representative Clustering

Choosing the optimized $k$ for K-Medoids Clustering¶

In [16]:
medoids_plot_internal
Out[16]:
No description has been provided for this image


Interpretation

1. SSE (Sum of Squared Errors) is consistently decreasing, which suggests that adding more clusters will always improve the fit but may lead to overfitting.

2. CH (Calinski-Harabasz has been flat 0.

3. The SC (Silhouette coefficient) doesn't show a strong peak, which typically would indicate an optimal number of clusters. In this case, the k closest to 0.5 is 3.

4. The DB (Davies-Bouldin) index, which should be minimized, shows that k=3 has the lowest DB index which aligns well with SC.

5. The Gap statistic, which compares the within-cluster dispersion with that expected under a null reference distribution of the data, will be seen on the k that will provide the biggest change. In this case, it would be 4 followed by 3.

We would choose the k value between 3 or 4, for parsimony's sake, we selected 3.

K-Medoids Scatter Plot¶

In [17]:
medscatter
Out[17]:
No description has been provided for this image
Looking at the 2D visualization, it may seem like the centroids are misplaced especially the yellow centroid. This is because we're only looking at 2D instead of all PCs. Overall, though, shows no compactness except for teal, obvious separations except for violet and teal, and imbalanced.
In [18]:
medoids_3d

Centroids of each cluster in K-Medoids¶

We're using this code to figure out which countries stand at the center of each cluster we've created with our K-Medoids model. By calculating the distances between each country's data point (after we've transformed them with PCA) and the center points of the clusters (the centroids), we can pinpoint exactly which countries are the closest to these central spots. These countries are our centroids.

Once we have these indices, they tell us which countries are the most representative of their respective clusters. It implies that these countries' educational indicators, those features we've analyzed, are central to the characteristics that define each cluster. Essentially, we're identifying the countries that best embody the common traits of their group, which helps us understand the unique educational profiles that exist across the globe.

In [19]:
selected_rows
Out[19]:
SE.COM.DURS SE.ENR.PRIM.FM.ZS SE.ENR.SECO.FM.ZS SE.PRM.AGES SE.PRM.DURS SE.PRM.ENRL.FE.ZS SE.PRM.ENRR SE.PRM.ENRR.FE SE.PRM.ENRR.MA SE.PRM.GINT.FE.ZS ... SE.PRM.OENR.ZS SE.PRM.PRIV.ZS SE.PRM.REPT.ZS SE.SEC.AGES SE.SEC.DURS SE.SEC.ENRL.FE.ZS SE.SEC.ENRL.GC.FE.ZS SE.SEC.ENRR SE.SEC.ENRR.FE SE.SEC.ENRR.MA
63 9.0 0.94839 0.86277 6.0 6.0 47.41941 110.742477 107.737221 113.600243 104.13425 ... 13.22256 11.76751 8.29433 12.0 6.0 45.28152 45.70542 66.517967 61.522942 71.309036
46 13.0 0.99531 0.98200 7.0 4.0 48.42716 100.015503 99.773361 100.243927 97.84767 ... 3.07440 9.19171 1.78200 11.0 8.0 48.25716 50.23067 96.799271 95.897491 97.655731
39 6.0 0.85185 0.62778 7.0 6.0 45.60272 87.285530 80.251091 94.208328 89.27442 ... 11.35905 28.89716 12.66202 13.0 7.0 38.23318 37.86940 37.603031 28.957790 46.127239

3 rows × 25 columns

In [20]:
centroids_plot()
Out[20]:
No description has been provided for this image

Given that we're looking at 2011 education data, we look at some indicators that clearly distinguish each centroid country from each other with regards to the educational systems and situation. These indicators give us a snapshot of each country's approach to education, and to further researcng on them, we deepdive into the broader societal and cultural contexts.

Morroco (MAR):

Indicator Code Indicator Name Description References
SE.PRM.ENRR School enrollment, primary, % gross Morocco's high rate (over 110%) indicates efforts to enroll all eligible children, including those outside the typical age range, possibly due to late starts or re-enrollment. However, this rate does not necessarily translate to completion rates or quality education. Scholaro
SE.PRM.GINT.FE.ZS Gross intake ratio in first grade of primary education, female, % of relevant age group The figure surpasses 104, suggesting a commendable effort toward gender equality at the entry level of education. World Bank
SE.PRM.NENR School enrollment, primary, % net A net enrollment rate around 93% signifies a strong drive to ensure children of official primary school age are attending school, although issues of quality can undermine the benefits of this high enrollment. Broken Chalk
SE.PRM.REPT.ZS Repeaters, primary, % of total enrollment The low rate of repetition shows efficiency in Morocco's primary education, suggesting that most students are progressing through grades as expected. -

Hungary (HUN):

Indicator Code Indicator Name Description References
SE.SEC.ENRL.GC.FE.ZS School enrollment, secondary, general, % female Nearly 97% of Hungarian females are enrolled in secondary education, indicating a high level of gender parity. However, socio-economic status significantly impacts educational participation. OECD
SE.PRM.PRIV.ZS School enrollment, primary, private % of total primary With a low value around 9%, Hungary's reliance on public education over private indicates a strong state educational system. European Commission
SE.PRM.REPT.ZS Repeaters, primary, % of total enrollment The low repetition rate around 1.8% in primary education suggests efficiency within the educational pathway. -
SE.SEC.DURS Secondary education, duration, years The longer duration of secondary education in Hungary reflects its in-depth educational approach. -

Guinea (GIN):

Indicator Code Indicator Name Description References
SE.SEC.ENRR School enrollment, secondary, % gross At about 37%, Guinea's rate is significantly lower than that of Hungary and Morocco, reflecting challenges in progressing from primary to secondary education. This aligns with insights that highlight Guinea's struggles with literacy rates and educational access, particularly for girls. Broken Chalk
SE.PRM.OENR.ZS Over-age students, primary, % of enrollment The high percentage (over 11%) indicates that a significant number of students are older than the typical age for their grade level, which may imply interruptions in educational progression and systemic issues within the education sector. -
SE.PRM.REPT.ZS Repeaters, primary, % of total enrollment The high value (over 12%) suggests inefficiencies, with many students needing to repeat grades, possibly due to quality of education or socio-economic factors. -
SE.PRM.PRIV.ZS School enrollment, primary, private % of total primary A higher reliance on private primary education could reflect gaps in the public education system. -

Range Plot for different features in K-Medoids¶

In [21]:
med_range
Out[21]:
No description has been provided for this image
Indicator Code Indicator Name Analysis for Cluster 0 Analysis for Cluster 1 Analysis for Cluster 2
SE.COM.DURS Compulsory education duration Ranges from 6 to 16 years, averaging nearly 10 years Requires about 13 years Centers around 6 years
SE.ENR.PRIM.FM.ZS Gross intake ratio in first grade of primary education, female High and practically universal High and practically universal Lowest, indicating barriers to female education access
SE.ENR.SECO.FM.ZS Gross intake ratio in first grade of secondary education, male Majority of eligible boys are enrolling Majority of eligible boys are enrolling Lower and more variable
SE.PRM.AGES Primary school starting age Approximately age 6 Approximately age 6 Somewhat younger onset
SE.PRM.DURS Primary education duration 6-year average Longer than the 6-year average 6-year average
SE.PRM.ENRL.FE.ZS Primary education, pupils % female High percentage of female students High percentage of female students Lower mean, suggesting gender disparities
SE.PRM.ENRR School enrollment, primary % gross Reaching 100%, indicating over-enrollment High enrollment rates Lower mean, indicating under-enrollment
SE.PRM.ENRR.FE School enrollment, primary, female % gross Significant female enrollment rates Significant female enrollment rates More variable and usually lower rate
SE.PRM.GINT.FE.ZS Gross intake ratio in first grade of primary education, female High intake ratio, nearly universal coverage Over-enrollment of females Lower mean, indicating enrollment issues
SE.PRM.NENR School enrollment, primary % net Almost universal net enrollment Lower average The lowest, suggesting high out-of-school youth
SE.PRM.OENR.ZS Over-age students, primary % of enrollment Lower average of over-aged pupils Greater rates, showing age-grade disparities Greater rates, showing age-grade disparities
SE.PRM.PRIV.ZS School enrollment, primary, private % of total primary - - -
SE.PRM.REPT.ZS Repeaters, primary % of total enrollment Low repeater rate, effective grade advancement Greater rates, more children repeating grades Greater rates, more children repeating grades
SE.SEC.ENRL.FE.ZS Secondary education, general pupils % female - Highest average secondary enrollment rate for females -
SE.SEC.ENRR School enrollment, secondary % gross High average rate, successful primary to secondary transition Lower average, mirrors Hungary's system Lowest mean, indicating drop-off rates post-primary education

In conclusion, Cluster 1, which stands for Hungary, frequently exhibits the highest average values across the educational measures, pointing to a robust and comprehensive educational framework. Morocco is represented by Cluster 0, which has strong enrollment numbers but struggles to keep up quality and lower dropout rates. With low enrolment rates, high repetition rates, and overage pupils, Cluster 2, which is representative of Guinea, presents the greatest difficulties and emphasizes the need for focused educational changes. While Clusters 0 and 1's discrepancies, especially in Cluster 1's higher results, highlight the contrasts in educational quality and system efficiency between Morocco and Hungary, their parallels also point to shared strengths in educational enrollment. Metrics highlighting Guinea's urgent educational needs set Cluster 2 apart.

II. Hierarchical Clustering


Choosing the right threshhold base on the dendogram of Ward's Method¶

In [22]:
dendro_original
Out[22]:
No description has been provided for this image
In [23]:
dendrogram_levelled
Out[23]:
No description has been provided for this image

Looking at the dendrogram, it's suggesting two main groups because the distance suddenly jumps a lot when going from two to one cluster. This big jump often means that forcing these two clusters together would be a bad fit — they're just too different, which could imply that one cluster might be quite different or an 'outlier' compared to the other. Another way to look at this is that these 2 clusters can clearly distinguish how different the early education status of countries in each cluster.

We might be looking at countries with unique educational systems or challenges that set them apart from the rest. That's pretty crucial to know because our goal is to identify different educational profiles for better policy and investment decisions. The clusters help us tailor our recommendations for each group based on their shared traits.

Since the biggest distance between two points between 310 and just above 500, we chose our threshhold between between 310 and 500.

In [24]:
ward
Out[24]:
No description has been provided for this image
In [25]:
comp_3d

The scatter plot further shows us how well the clustering worked. Looking at them, the separation can be clearly seen but there are still overlaps. Additionally, the violet cluster doesn't show compactness while the yellow cluster does further strengthening our hypothesis of one cluster becoming a cluster for otuliers instead of it being a real cluster. It shows both balance and parsimony since we only used 2 clusters and both clusters have almost equal number of points.

Range Plot for different features in Ward's Method¶

In [26]:
ward_range
Out[26]:
No description has been provided for this image
Indicator Code Indicator Name Analysis for Cluster 0 Analysis for Cluster 1
SE.COM.DURS Compulsory education duration Broader range peaking at 16 years Shorter and more consistent duration
SE.ENR.PRIM.FM.ZS Gross intake ratio in first grade of primary education, female Strong start for female primary education Slightly higher average intake ratio
SE.ENR.SECO.FM.ZS Gross intake ratio in first grade of secondary education, male Greater variability More uniform intake
SE.PRM.AGES Primary school starting age Wider range of starting ages Average start around age 6
SE.PRM.DURS Primary education duration Averages a longer duration -
SE.PRM.ENRL.FE.ZS Primary education, pupils % female Higher peak of female pupils Similar average percentage
SE.PRM.ENRR School enrollment, primary % gross - Higher average gross primary enrollment
SE.PRM.ENRR.FE School enrollment, primary, female % gross Wider spread and higher maximum value Averages a higher rate of female enrollment
SE.PRM.GINT.FE.ZS Gross intake ratio in first grade of primary education, female Larger range Consistently higher average intake ratio
SE.PRM.NENR School enrollment, primary % net - Notably higher average net enrollment
SE.PRM.OENR.ZS Over-age students, primary % of enrollment More significant challenges with age-grade distortion Significantly lower average
SE.PRM.PRIV.ZS School enrollment, primary, private % of total primary Higher reliance on private education -
SE.PRM.REPT.ZS Repeaters, primary % of total enrollment Higher and more variable repeater rate Fewer students repeating grades
SE.SEC.ENRL.FE.ZS Secondary education, general pupils % female - Higher average enrollment rate for females
SE.SEC.ENRR School enrollment, secondary % gross - Higher average gross enrollment rate

Cluster 1, encompassing countries like Argentina, Albania, Armenia, Austria, Turkey, Ukraine, Uzbekistan, and Venezuela, is characterized by consistent and higher educational rates. This cluster demonstrates shorter, more uniform compulsory education durations and a higher intake of females in both primary and secondary education, indicating a focus on gender parity. Additionally, higher net primary enrollment rates, fewer over-aged students, and a lower rate of repeaters in these countries suggest efficient educational progress, despite the diverse challenges specific to each nation's context, such as rural access or political influences.

Contrastingly, Cluster 0 includes countries like Burundi, Burkina Faso, Dominican Republic, Djibouti, Bhutan, Ethiopia, Ghana, India, Morocco, and Laos, exhibiting a broader range in educational metrics. This cluster's diverse educational systems are reflected in longer durations of primary education and varied starting ages, indicative of different national policies. The higher reliance on private schooling and variable repeater rates within these countries point to challenges in the public education system>, particularly in developing nations where educational quality and access remain key issues.

In summary, while Cluster 1 represents a more uniform, efficient educational system, indicating stronger policy implementations and emphasis on gender parity, Cluster 0 reveals a landscape of diverse educational challenges and approaches. Each cluster, despite its overarching characteristics, comprises countries with unique educational strengths and weaknesses. This distinction underscores the need for nuanced understanding and targeted educational strategies, recognizing the potential models in Cluster 1 and addressing specific needs highlighted by the diversity in Cluster 0.

III. Density-Based Clustering

In [27]:
dbscatter
Out[27]:
No description has been provided for this image
In [28]:
db_3d

In examining the clusters in hierarchical clustering using Ward's method, it becomes evident that two clusters are identified. Conversely, density-based clustering suggests a single prominent cluster, with additional data points being classified as outliers. Notably, many of these outliers are recognized as part of a cluster within the hierarchical framework which is further reflected as the violet cluster in heirarchical clustering have resemblance with the outliers of density-based clustering.

This discrepancy highlights the distinct methodologies of the clustering techniques: hierarchical clustering's approach tends to group data based on global patterns, while density-based clustering focuses on dense regions of data points.


Looking at the scatter plot for both Hierarchical and Density-Based Clustering we can see that they are almost similar, wherein the outliers in DBSCAN are considered to be a cluster in Hierarchal Clustering. Diving dddeper further, we are going to deploy a Cluster Purity Test to examine how well likely they are the same and explaining each other. In this case, we will consider the outliers as a cluster of its own.

IV. Cluster Purity Test

In [29]:
def purity(y_pred_db, y_pred_ward):
    """Compute the class purity

    Parameters
    ----------
    y_true : array
        List of ground-truth labels
    y_pred : array
        Cluster labels

    Returns
    -------
    purity : float
        Class purity
    """
    matrix = confusion_matrix(y_pred_db, y_pred_ward)

    return np.sum(np.amax(matrix, axis=0)) / np.sum(matrix)
In [30]:
score = purity(y_predict_country_com, cluster_labels)

The cluster purity test results for both hierarchical and Density-Based clustering methods yielded a notable score of 81.37%. This implies that 81.37% of the data points in each cluster belong to the same class, indicating a high degree of homogeneity within the clusters. Such a score suggests that both clustering methods have effectively grouped the data points, with a majority of points in each cluster sharing common characteristics or features.

This level of purity is significant, especially considering the complexities and potential irregularities inherent in real-world datasets. It suggests that both hierarchical and Density-Based methods are adept at identifying and grouping similar data points, even in the absence of predefined cluster boundaries or assumptions about the data distribution.

This happened because the Ward Linkage used for Hierarchical Clustering was able to identify the outliers from DBSCAN as a distinct cluster because it focuses on minimizing the increase in within-cluster variance, rather than adhering to a strict density threshold like DBSCAN. This approach allows Ward's method to recognize sparser groups of points as valid clusters, which DBSCAN might label as outliers due to their lower density. Essentially, Ward's Linkage prioritizes overall cluster cohesion over density, enabling it to classify less dense but meaningful groups of data points as clusters.


V. Conclusion and Recommendation

No description has been provided for this image


After analyzing the educational data from 2011, our representative clustering has pinpointed key differences and similarities in education systems worldwide. Morocco and Hungary, although having high enrollment rates, differ in their ability to keep students in school past primary education, especially girls. Guinea is dealing with more basic issues, struggling to even get kids consistently through the primary level. By offering actions aligned with our findings, we can help countries within each group to improve their education systems. This isn't just about hitting enrollment targets but making sure that every child gets a quality education and a real shot at a better future.

As for our analysis of Heirarchical Clustering and DB Clustering, our purity test for the hierarchical and Density-Based clustering methods gave us a solid score of 81.37%. This means that a big majority of the points in each cluster really belong together, showing that both methods did a great job in grouping similar data points. What's interesting is how the hierarchical method, especially with Ward Linkage, picked up on the outliers that DBSCAN found and treated them as a separate cluster. This shows us that combining both methods gives us a fuller picture, making it a smart move to use them together in our analysis for more accurate and detailed results. Cluster 1, featuring countries like Argentina and Austria, shows consistent and higher educational rates with a focus on gender parity and efficient educational progress. However, its diverse challenges, such as rural education access in Austria, illustrate the unique contexts within the cluster. Cluster 0, which are also the outliers in DB, including nations like India and Burkina Faso, displays a broader range in educational metrics, indicative of diverse educational systems and challenges in public education, like access and quality.

While these clusters provide an overview of more developed (Cluster 1) versus developing (Cluster 0) education systems, it's crucial to remember that each country within a cluster has unique educational strengths and weaknesses. This underscores the need for a deeper, more nuanced understanding of each country's specific educational landscape to inform targeted and effective educational strategies.


Implication in the Philippines Context

Considering the educational challenges highlighted in the Philippines, including infrastructure deficiencies, private school closures, and the quality of learning outcomes, it's plausible to suggest that the Philippines might align with the cluster represented by Guinea. This cluster signifies educational systems grappling with fundamental issues that significantly impact the quality and inclusivity of education.

The Basic Education Report 2023 of the Philippines revealed substantial infrastructural deficits, with a considerable number of school buildings requiring major repairs or being marked for condemnation based on iTacloban's report. Moreover, the lack of facilities and resources has been underscored as a critical concern. These challenges mirror those seen in Guinea's cluster, where systemic issues in infrastructure and resource provision hinder the educational process.

Furthermore, issues with the procurement process, a decline in enrollment in private schools, and concerns over curriculum and employability resonate with the broader challenges identified in Guinea's cluster. The Philippines' education sector is also dealing with the aftermath of prolonged school closures and the need for a significant curriculum overhaul, suggesting a potential match with Guinea's cluster profile.

The current state of the Philippines' education system suggests the need for targeted interventions similar to those recommended for Guinea's cluster. These would include infrastructural investments, curriculum reforms to address 21st-century skills, and strategies to improve the overall quality and accessibility of education.


VI. REFERENCES

No description has been provided for this image
  • Amador III, J. (2023, February). The Philippines’ Basic Education Crisis. The Diplomat. Retrieved February 11, 2024, from https://thediplomat.com/2023/02/the-philippines-basic-education-crisis/

  • Broken Chalk. (n.d.). Beyond the Medina: Unpacking Morocco’s Educational Challenges. Broken Chalk. Retrieved February 11, 2024, from https://brokenchalk.org/beyond-the-medina-unpacking-moroccos-educational-challenges/

  • Broken Chalk. (n.d.). Challenges in Guinea's Education System. Broken Chalk. Retrieved February 11, 2024, from https://brokenchalk.org/challenges-in-guineas-education-system/

  • ChatGPT. (2024). Assistance with analysis and phrasing for educational systems project. OpenAI. Retrieved February 11, 2024, from https://openai.com/chatgpt

  • European Commission. (n.d.). Hungary Overview. Eurydice - European Commission. Retrieved February 11, 2024, from https://eurydice.eacea.ec.europa.eu/national-education-systems/hungary/overview

  • iTacloban. (2023, January). Basic Education Report 2023. Retrieved February 11, 2024, from https://www.itacloban.com/2023/01/basic-education-report-2023.html

  • OECD. (n.d.). Education at a Glance 2021: OECD Indicators. OECD iLibrary. Retrieved February 11, 2024, from https://www.oecd-ilibrary.org/docserver/9789264273344-6-en.pdf

  • PBEd. (2023). State of Philippine Education Report 2023. Philippine Business for Education. Retrieved February 11, 2024, from https://pbed.ph/blogs/47/PBEd/State%20of%20Philippine%20Education%20Report%202023

  • Scholaro. (n.d.). Morocco Education System. Scholaro. Retrieved February 11, 2024, from https://www.scholaro.com/pro/Countries/Morocco/Education-System

  • Soleymani, A. (n.d.). Beyond scikit-learn: Is it time to retire K-means and use this method instead? Medium. Retrieved February 11, 2024, from https://medium.com/@ali.soleymani.co/beyond-scikit-learn-is-it-time-to-retire-k-means-and-use-this-method-instead-b8eb9ca9079a

  • The World Bank. (n.d.). Morocco Overview. The World Bank. Retrieved February 11, 2024, from https://www.worldbank.org/en/country/morocco/overview